Chinese Short Text Classification Based on Domain Knowledge

نویسندگان

  • Xiao Feng
  • Yang Shen
  • Chengyong Liu
  • Wei Liang
  • Shuwu Zhang
چکیده

People are generating more and more short texts. There is an urgent demand to classify short texts into different domains. Due to the shortness and sparseness of short texts, conventional methods based on Vector Space Model (VSM) have limitations. To tackle the data scarcity problem, we propose a new model to directly measure the correlation between a short text instance and a domain instead of representing short texts as vectors of weights. We firstly draw domain knowledge for each user-defined domain using an external corpus of longer documents. Secondly, the correlation is calculated by measuring the proportion of the overlapping part of the instance and the domain knowledge. Finally, if the correlation is greater than a threshold, the instance will be classified into the domain. Experimental results show that the classifier based on the proposed model outperforms the state-of-the-art baselines based on VSM.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Some Studies on Chinese Domain Knowledge Dictionary and Its Application to Text Classification

In this paper, we study some issues on Chinese domain knowledge dictionary and its application to text classification task. First a domain knowledge hierarchy description framework and our Chinese domain knowledge dictionary named NEUKD are introduced. Second, to alleviate the cost of construction of domain knowledge dictionary by hand, we use a boostrapping-based algorithm to learn new domain ...

متن کامل

خوشه‌بندی اسناد مبتنی بر آنتولوژی و رویکرد فازی

Data mining, also known as knowledge discovery in database, is the process to discover unknown knowledge from a large amount of data. Text mining is to apply data mining techniques to extract knowledge from unstructured text. Text clustering is one of important techniques of text mining, which is the unsupervised classification of similar documents into different groups. The most important step...

متن کامل

Topic Modeling and Classification of Cyberspace Papers Using Text Mining

The global cyberspace networks provide individuals with platforms to can interact, exchange ideas, share information, provide social support, conduct business, create artistic media, play games, engage in political discussions, and many more. The term cyberspace has become a conventional means to describe anything associated with the Internet and the diverse Internet culture. In fact, cyberspac...

متن کامل

Research on Domain Knowledge Based Chinese Short Message Understanding Model

With the rapid increase of the short message service in China, the information query technology based on the Chinese natural language is becoming a research hotspot at present. An algorithm of the Chinese natural language understanding based on certain domain knowledge is proposed, and it is applied to a Chinese short message based information query system. The algorithm is divided into three i...

متن کامل

A New Method for Sentiment Classification in Text Retrieval

Traditional text categorization is usually a topic-based task, but a subtle demand on information retrieval is to distinguish between positive and negative view on text topic. In this paper, a new method is explored to solve this problem. Firstly, a batch of Concerned Concepts in the researched domain is predefined. Secondly, the special knowledge representing the positive or negative context o...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2013